Scroll to top

Credit Card Fraud Detection

Business Understanding

It is important that credit card companies are able to recognize fraudulent credit card transactions so that customers are not charged for items that they did not purchase.

Dataset Description

The datasets contains transactions made by credit cards in September 2013 by european cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

It contains only numerical input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data. Features V1, V2, ... V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be used for example-dependant cost-senstive learning. Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.

Objective: Identify fraudulent credit card transactions.

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import os

# Any results you write to the current directory are saved as output.
In [2]:
# Import libraries that are necessary for visualizing purposes
%matplotlib inline
import matplotlib.pyplot as plt 
import seaborn as sns
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls
import plotly.figure_factory as ff
import warnings
warnings.filterwarnings('ignore')
In [3]:
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score, accuracy_score, classification_report
In [4]:
data = pd.read_csv('creditcard.csv')
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 31 columns):
Time      284807 non-null float64
V1        284807 non-null float64
V2        284807 non-null float64
V3        284807 non-null float64
V4        284807 non-null float64
V5        284807 non-null float64
V6        284807 non-null float64
V7        284807 non-null float64
V8        284807 non-null float64
V9        284807 non-null float64
V10       284807 non-null float64
V11       284807 non-null float64
V12       284807 non-null float64
V13       284807 non-null float64
V14       284807 non-null float64
V15       284807 non-null float64
V16       284807 non-null float64
V17       284807 non-null float64
V18       284807 non-null float64
V19       284807 non-null float64
V20       284807 non-null float64
V21       284807 non-null float64
V22       284807 non-null float64
V23       284807 non-null float64
V24       284807 non-null float64
V25       284807 non-null float64
V26       284807 non-null float64
V27       284807 non-null float64
V28       284807 non-null float64
Amount    284807 non-null float64
Class     284807 non-null int64
dtypes: float64(30), int64(1)
memory usage: 67.4 MB
In [5]:
data.isnull().sum()
Out[5]:
Time      0
V1        0
V2        0
V3        0
V4        0
V5        0
V6        0
V7        0
V8        0
V9        0
V10       0
V11       0
V12       0
V13       0
V14       0
V15       0
V16       0
V17       0
V18       0
V19       0
V20       0
V21       0
V22       0
V23       0
V24       0
V25       0
V26       0
V27       0
V28       0
Amount    0
Class     0
dtype: int64

We know that all features except amount and time are derived from principal components through pca and from initial inference there appears to be no missing values within the dataset

Let's check the target variable distribution denoted by label 'Class' in the dataset

In [6]:
data['Class'].value_counts()
Out[6]:
0    284315
1       492
Name: Class, dtype: int64
In [7]:
# Let's print the proportion of target variable in the dataset
print('No Frauds', round(data['Class'].value_counts()[0]/len(data) * 100,2), '% of the dataset')
print('Frauds', round(data['Class'].value_counts()[1]/len(data) * 100,2), '% of the dataset')
No Frauds 99.83 % of the dataset
Frauds 0.17 % of the dataset
In [8]:
labels = ['Normal', 'Fraud']
values = data['Class'].value_counts()
colors = ['rgb(32, 148, 159)','#FEBFB3']
trace = go.Pie(labels=labels, values=values,
               hoverinfo='label+percent', textinfo='value', 
               textfont=dict(size=20),
               marker=dict(colors=colors, 
                           line=dict(color='#000000', width=0.5)))

py.iplot([trace], filename='styled_pie_chart')
In [9]:
data.shape
Out[9]:
(284807, 31)
In [10]:
data['Amount'].groupby(data['Class']).describe()
Out[10]:
count mean std min 25% 50% 75% max
Class
0 284315.0 88.291022 250.105092 0.0 5.65 22.00 77.05 25691.16
1 492.0 122.211321 256.683288 0.0 1.00 9.25 105.89 2125.87
In [11]:
from sklearn.preprocessing import StandardScaler

data['normalizedAmount'] = StandardScaler().fit_transform(data['Amount'].values.reshape(-1, 1))
In [12]:
from plotly import tools
Normal_amt = data['Amount'].loc[data['Class']==0]
fraud_amt = data['Amount'].loc[data['Class']==1]
normal_class =data['Class'].loc[data['Class']==0]
fraud_class = data['Class'].loc[data['Class']==1]
trace0 = go.Box(
    y= Normal_amt,
    x= normal_class,
    name='Not Fraud',
    marker=dict(
        color='#3D9970'
    )
)
trace1 = go.Box(
    y= fraud_amt,
    x= fraud_class,
    name='Fraud',
    marker=dict(
        color='#FF4136'
    )
)
trace2 = go.Box(
    y= np.log(Normal_amt),
    x= normal_class,
    name='Not Fraud Log Amount',
    marker=dict(
        color= 'rgb(32, 148, 159)'
    )
)
trace3 = go.Box(
    y= np.log(fraud_amt),
    x= fraud_class,
    name='Fraud Log Amount ',
    marker=dict(
        color='#FEBFB3'
    )
)

fig = tools.make_subplots(rows=2, cols=2, shared_yaxes=True)

fig.append_trace(trace0, 1, 1)
fig.append_trace(trace1, 1, 2)
fig.append_trace(trace2, 2, 1)
fig.append_trace(trace3, 2, 2)

fig['layout'].update(height=700, width=800,
                     title='Box Plot Target Variable vs Transcation Amount')
py.iplot(fig, filename='multiple-subplots-shared-yaxes')
This is the format of your plot grid:
[ (1,1) x1,y1 ]  [ (1,2) x2,y1 ]
[ (2,1) x3,y2 ]  [ (2,2) x4,y2 ]

In [13]:
data.shape
Out[13]:
(284807, 32)
In [14]:
data = data.drop(['Time','Amount'],axis=1)
data.head()
Out[14]:
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 ... V21 V22 V23 V24 V25 V26 V27 V28 Class normalizedAmount
0 -1.359807 -0.072781 2.536347 1.378155 -0.338321 0.462388 0.239599 0.098698 0.363787 0.090794 ... -0.018307 0.277838 -0.110474 0.066928 0.128539 -0.189115 0.133558 -0.021053 0 0.244964
1 1.191857 0.266151 0.166480 0.448154 0.060018 -0.082361 -0.078803 0.085102 -0.255425 -0.166974 ... -0.225775 -0.638672 0.101288 -0.339846 0.167170 0.125895 -0.008983 0.014724 0 -0.342475
2 -1.358354 -1.340163 1.773209 0.379780 -0.503198 1.800499 0.791461 0.247676 -1.514654 0.207643 ... 0.247998 0.771679 0.909412 -0.689281 -0.327642 -0.139097 -0.055353 -0.059752 0 1.160686
3 -0.966272 -0.185226 1.792993 -0.863291 -0.010309 1.247203 0.237609 0.377436 -1.387024 -0.054952 ... -0.108300 0.005274 -0.190321 -1.175575 0.647376 -0.221929 0.062723 0.061458 0 0.140534
4 -1.158233 0.877737 1.548718 0.403034 -0.407193 0.095921 0.592941 -0.270533 0.817739 0.753074 ... -0.009431 0.798278 -0.137458 0.141267 -0.206010 0.502292 0.219422 0.215153 0 -0.073403

5 rows × 30 columns

In [15]:
X = data.drop('Class', axis=1)
y = data['Class']
In [16]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold
skfold = StratifiedKFold(n_splits=5, random_state=None, shuffle=False)
In [17]:
for train_index, test_index in skfold.split(X, y):
    print("Train:", train_index, "Test:", test_index)
    X_train,X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]




# Turn into an array
X_train = X_train.values
X_test = X_test.values
y_train = y_train.values
y_test = y_test.values

# See if both the train and test label distribution are similarly distributed
train_unique_label, train_counts_label = np.unique(y_train, return_counts=True)
test_unique_label, test_counts_label = np.unique(y_test, return_counts=True)
print('-' * 100)

print('Label Distributions: \n')
print('The training set distribution is {}'.format(train_counts_label/ len(y_train)))
print('The training set distribution is {}'.format(test_counts_label/ len(y_test)))
Train: [ 30473  30496  31002 ... 284804 284805 284806] Test: [    0     1     2 ... 57017 57018 57019]
Train: [     0      1      2 ... 284804 284805 284806] Test: [ 30473  30496  31002 ... 113964 113965 113966]
Train: [     0      1      2 ... 284804 284805 284806] Test: [ 81609  82400  83053 ... 170946 170947 170948]
Train: [     0      1      2 ... 284804 284805 284806] Test: [150654 150660 150661 ... 227866 227867 227868]
Train: [     0      1      2 ... 227866 227867 227868] Test: [212516 212644 213092 ... 284804 284805 284806]
----------------------------------------------------------------------------------------------------
Label Distributions: 

The training set distribution is [0.99827076 0.00172924]
The training set distribution is [0.99827952 0.00172048]
In [18]:
print('Length of X (train): {} | Length of y (train): {}'.format(len(X_train), len(y_train)))
print('Length of X (test): {} | Length of y (test): {}'.format(len(X_test), len(y_test)))
Length of X (train): 227846 | Length of y (train): 227846
Length of X (test): 56961 | Length of y (test): 56961
In [19]:
from imblearn.combine import SMOTEENN
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from imblearn.pipeline import make_pipeline as imbalanced_make_pipeline

from pprint import pprint
accuracy_lst = []
precision_lst = []
recall_lst = []
f1_lst = []
auc_lst = []

# Intialize the random forest model with class_weights = balanced, since we have oversampled the training set
clf_rf = RandomForestClassifier(n_estimators =50,oob_score = False,
                                random_state=42,class_weight= {1:10}, n_jobs=-1)
print('Parameters currently in use:\n')
pprint(clf_rf.get_params())
Using TensorFlow backend.
Parameters currently in use:

{'bootstrap': True,
 'class_weight': {1: 10},
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 50,
 'n_jobs': -1,
 'oob_score': False,
 'random_state': 42,
 'verbose': 0,
 'warm_start': False}
In [20]:
# Implementing SMOTE Technique 
# Cross Validating the right way
# Parameters

# Number of estimators
n_estimators = [30,50]
# Number of features to consider at every split
max_features = ['sqrt','log2']
# Maximum number of levels in tree
max_depth = [6, 9, 15]

random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth}


rand_rf = RandomizedSearchCV(clf_rf, param_distributions=random_grid, n_iter=4,
                                  random_state=42, n_jobs=-1, return_train_score=False)
In [21]:
for train, test in skfold.split(X_train, y_train):
    pipeline = imbalanced_make_pipeline(SMOTEENN(sampling_strategy='minority'), rand_rf) 
    # SMOTE happens during Cross Validation not before..
    model = pipeline.fit(X_train[train], y_train[train])
    best_est = rand_rf.best_estimator_
    prediction = best_est.predict(X_train[test])
    
    accuracy_lst.append(pipeline.score(X_train[test], y_train[test]))
    precision_lst.append(precision_score(y_train[test], prediction))
    recall_lst.append(recall_score(y_train[test], prediction))
    f1_lst.append(f1_score(y_train[test], prediction))
    auc_lst.append(roc_auc_score(y_train[test], prediction))
    
print('---' * 45)
print('')
print("accuracy: {}".format(np.mean(accuracy_lst)))
print("precision: {}".format(np.mean(precision_lst)))
print("recall: {}".format(np.mean(recall_lst)))
print("f1: {}".format(np.mean(f1_lst)))
print('---' * 45)
---------------------------------------------------------------------------------------------------------------------------------------

accuracy: 0.9941408157995291
precision: 0.23668067711702018
recall: 0.832262252515417
f1: 0.3607341603502758
---------------------------------------------------------------------------------------------------------------------------------------
In [22]:
print('The best estimator for random forest with SMOTEENN is {}'.format(best_est))
The best estimator for random forest with SMOTEENN is RandomForestClassifier(bootstrap=True, class_weight={1: 10}, criterion='gini',
            max_depth=15, max_features='sqrt', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=30, n_jobs=-1,
            oob_score=False, random_state=42, verbose=0, warm_start=False)
In [23]:
import pickle
rf_filename = 'rf_best_estimator.sav'
pickle.dump(best_est, open(rf_filename, 'wb'))
In [24]:
if os.path.isfile('rf_best_estimator.sav'):
    clf_rf = pickle.load(open('rf_best_estimator.sav', 'rb'))
    print("Fitted random forest model has been loaded from pickle file. Run prediction on dataset")
else:
    print('Pickle object not found. Train the model and dump the fitted model using pickle')
Fitted random forest model has been loaded from pickle file. Run prediction on dataset
In [25]:
from imblearn.ensemble import BalancedRandomForestClassifier
brf = BalancedRandomForestClassifier(n_estimators=200, replacement= True,max_features =0.3,
                                     random_state=0,class_weight='balanced_subsample',max_depth = 25,
                                     n_jobs=-1)
In [26]:
brf.fit(X_train, y_train)
Out[26]:
BalancedRandomForestClassifier(bootstrap=True,
                class_weight='balanced_subsample', criterion='gini',
                max_depth=25, max_features=0.3, max_leaf_nodes=None,
                min_impurity_decrease=0.0, min_samples_leaf=2,
                min_samples_split=2, min_weight_fraction_leaf=0.0,
                n_estimators=200, n_jobs=-1, oob_score=False,
                random_state=0, replacement=True, sampling_strategy='auto',
                verbose=0, warm_start=False)
In [27]:
y_pred_brf = brf.predict(X_test)
In [28]:
from sklearn.metrics import balanced_accuracy_score
from imblearn.metrics import geometric_mean_score
from sklearn.metrics import confusion_matrix
import itertools
In [29]:
def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=0)
    plt.yticks(tick_marks, classes)

    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        #print("Normalized confusion matrix")
    else:
        1#print('Confusion matrix, without normalization')

    #print(cm)

    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j],
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
In [30]:
fig, ax = plt.subplots(ncols=1)
print('Balanced Random Forest classifier performance:')
print('Balanced accuracy: {:.2f} - Geometric mean {:.2f}'
      .format(balanced_accuracy_score(y_test, y_pred_brf),
              geometric_mean_score(y_test, y_pred_brf)))
cm_brf = confusion_matrix(y_test, y_pred_brf)
plot_confusion_matrix(cm_brf, classes=np.unique(data['Class']),
                      title='Balanced random forest')
Balanced Random Forest classifier performance:
Balanced accuracy: 0.93 - Geometric mean 0.93
In [31]:
from sklearn.metrics import classification_report
labels = ['No Fraud', 'Fraud']
print(classification_report(y_test, y_pred_brf, target_names=labels))
              precision    recall  f1-score   support

    No Fraud       1.00      0.98      0.99     56863
       Fraud       0.06      0.88      0.11        98

   micro avg       0.98      0.98      0.98     56961
   macro avg       0.53      0.93      0.55     56961
weighted avg       1.00      0.98      0.99     56961

In [32]:
from sklearn.metrics import average_precision_score
y_score = brf.predict_proba(X_test)[:,1]
average_precision = average_precision_score(y_test, y_score)

print('Average precision-recall score: {0:0.2f}'.format(
      average_precision))
Average precision-recall score: 0.81
In [33]:
from sklearn.metrics import precision_recall_curve
fig = plt.figure(figsize=(12,6))

precision, recall, _ = precision_recall_curve(y_test, y_score)

plt.step(recall, precision, color='r', alpha=0.2,
         where='post')
plt.fill_between(recall, precision, step='post', alpha=0.2,
                 color='#F59B00')

plt.xlabel('Recall')
plt.ylabel('Precision')
plt.ylim([0.0, 1.05])
plt.xlim([0.0, 1.0])
plt.title('OverSampling Precision-Recall curve: \n Average Precision-Recall Score ={0:0.2f}'.format(
          average_precision), fontsize=16)
Out[33]:
Text(0.5, 1.0, 'OverSampling Precision-Recall curve: \n Average Precision-Recall Score =0.81')
In [34]:
thresholds = [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]

plt.figure(figsize=(10,10))

j = 1
for i in thresholds:
    y_test_predictions_high_recall = y_score > i
    
    plt.subplot(3,3,j)
    j += 1
    
    # Compute confusion matrix
    cnf_matrix = confusion_matrix(y_test,y_test_predictions_high_recall)
    np.set_printoptions(precision=2)
    print('Threshold {}'.format(i))
    print("Precision metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,1]+cnf_matrix[0,1]))
    print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))

    # Plot non-normalized confusion matrix
    class_names = ['0','1']
    plot_confusion_matrix(cnf_matrix
                          , classes=class_names
                          , title='Threshold >= %s'%i) 
Threshold 0.1
Precision metric in the testing dataset:  0.005244660719113274
Recall metric in the testing dataset:  0.9897959183673469
Threshold 0.2
Precision metric in the testing dataset:  0.00953879437506146
Recall metric in the testing dataset:  0.9897959183673469
Threshold 0.3
Precision metric in the testing dataset:  0.017938931297709924
Recall metric in the testing dataset:  0.9591836734693877
Threshold 0.4
Precision metric in the testing dataset:  0.03397683397683398
Recall metric in the testing dataset:  0.8979591836734694
Threshold 0.5
Precision metric in the testing dataset:  0.06125356125356125
Recall metric in the testing dataset:  0.8775510204081632
Threshold 0.6
Precision metric in the testing dataset:  0.152014652014652
Recall metric in the testing dataset:  0.8469387755102041
Threshold 0.7
Precision metric in the testing dataset:  0.3025830258302583
Recall metric in the testing dataset:  0.8367346938775511
Threshold 0.8
Precision metric in the testing dataset:  0.6446280991735537
Recall metric in the testing dataset:  0.7959183673469388
Threshold 0.9
Precision metric in the testing dataset:  0.9
Recall metric in the testing dataset:  0.7346938775510204
In [35]:
features =['V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 'V11',
       'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20', 'V21',
       'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28','normalizedAmount']
In [36]:
plt.figure(figsize = (9,5))

feat_import = pd.DataFrame({'Feature': features, 'Feature importance': brf.feature_importances_})
feat_import = feat_import.sort_values(by='Feature importance',ascending=False)

g = sns.barplot(x='Feature',y='Feature importance',data=feat_import)
g.set_xticklabels(g.get_xticklabels(),rotation=90)
g.set_title('Features importance - Random Forest',fontsize=20)
plt.show() 
In [37]:
from keras.models import Sequential
from keras.layers import Activation, Dense, Dropout, BatchNormalization
from keras.optimizers import Adam
from keras.metrics import categorical_crossentropy


def make_model(n_features):
    model = Sequential()
    model.add(Dense(200, input_shape=(n_features,),
              kernel_initializer='glorot_normal'))
    model.add(Activation('relu'))
    model.add(BatchNormalization())
    model.add(Dropout(0.5))
    model.add(Dense(100, kernel_initializer='glorot_normal'))
    model.add(Activation('relu'))
    model.add(BatchNormalization())
    model.add(Dropout(0.25))
    model.add(Dense(50, kernel_initializer='glorot_normal'))
    model.add(Activation('relu'))
    model.add(BatchNormalization())
    model.add(Dropout(0.15))
    model.add(Dense(25, kernel_initializer='glorot_normal'))
    model.add(Activation('relu'))
    model.add(BatchNormalization())
    model.add(Dropout(0.1))
    model.add(Dense(1, activation='sigmoid'))

    model.compile(loss='binary_crossentropy',
                  optimizer='adam',
                  metrics=['accuracy'])

    return model
In [38]:
import time
from functools import wraps


def timeit(f):
    @wraps(f)
    def wrapper(*args, **kwds):
        start_time = time.time()
        result = f(*args, **kwds)
        elapsed_time = time.time() - start_time
        print('Elapsed computation time: {:.3f} secs'
              .format(elapsed_time))
        return (elapsed_time, result)
    return wrapper
In [39]:
@timeit
def fit_predict_imbalanced_model(X_train, y_train, X_test, y_test):
    model = make_model(X_train.shape[1])
    print(model.summary())
    model.fit(X_train, y_train, epochs=2, verbose=1, batch_size=200)
    y_pred = model.predict_proba(X_test, batch_size=200)[:,1]
    return roc_auc_score(y_test, y_pred)
In [40]:
from imblearn.keras import BalancedBatchGenerator


@timeit
def fit_predict_balanced_model(X_train, y_train, X_test, y_test):
    model = make_model(X_train.shape[1])
    training_generator = BalancedBatchGenerator(X_train, y_train,
                                                batch_size=200,
                                                random_state=42)
    model.fit_generator(generator=training_generator, epochs=5, verbose=1)
    y_pred = model.predict_proba(X_test, batch_size=200)[:,1]
    return roc_auc_score(y_test, y_pred)
In [43]:
from imblearn.keras import balanced_batch_generator
from imblearn.under_sampling import NeighbourhoodCleaningRule 
In [44]:
model = make_model(X_train.shape[1])
In [45]:
skf = StratifiedKFold(n_splits=5)


for train_idx, valid_idx in skf.split(X_train, y_train):
    X_local_train = X_train[train_idx]
    y_local_train = y_train[train_idx]
    X_local_test = X_train[valid_idx]
    y_local_test = y_train[valid_idx]
In [46]:
training_generator = BalancedBatchGenerator(X_local_train, y_local_train,
                                            batch_size=100,
                                            random_state=42)
model.fit_generator(generator=training_generator, epochs=10,validation_data=(X_local_test, y_local_test) ,verbose=1)
Epoch 1/10
6/6 [==============================] - 2s 409ms/step - loss: 0.8632 - acc: 0.5317 - val_loss: 0.3164 - val_acc: 0.9411
Epoch 2/10
6/6 [==============================] - 1s 117ms/step - loss: 0.4400 - acc: 0.7983 - val_loss: 0.1521 - val_acc: 0.9878
Epoch 3/10
6/6 [==============================] - 1s 116ms/step - loss: 0.3362 - acc: 0.8567 - val_loss: 0.0968 - val_acc: 0.9919
Epoch 4/10
6/6 [==============================] - 1s 116ms/step - loss: 0.2803 - acc: 0.9033 - val_loss: 0.0752 - val_acc: 0.9912
Epoch 5/10
6/6 [==============================] - 1s 117ms/step - loss: 0.2379 - acc: 0.9083 - val_loss: 0.0626 - val_acc: 0.9895
Epoch 6/10
6/6 [==============================] - 1s 116ms/step - loss: 0.2246 - acc: 0.9250 - val_loss: 0.0575 - val_acc: 0.9881
Epoch 7/10
6/6 [==============================] - 1s 118ms/step - loss: 0.2218 - acc: 0.9350 - val_loss: 0.0541 - val_acc: 0.9872
Epoch 8/10
6/6 [==============================] - 1s 117ms/step - loss: 0.1879 - acc: 0.9350 - val_loss: 0.0530 - val_acc: 0.9858
Epoch 9/10
6/6 [==============================] - 1s 117ms/step - loss: 0.1810 - acc: 0.9350 - val_loss: 0.0538 - val_acc: 0.9842
Epoch 10/10
6/6 [==============================] - 1s 120ms/step - loss: 0.1780 - acc: 0.9367 - val_loss: 0.0527 - val_acc: 0.9833
Out[46]:
<keras.callbacks.History at 0x7f26ae0e42b0>
In [47]:
training_generator, steps_per_epoch = balanced_batch_generator(X_local_train, y_local_train, 
                            batch_size=100, sampler=NeighbourhoodCleaningRule(),random_state=42)
model.fit_generator(generator=training_generator, steps_per_epoch=steps_per_epoch,
                                     epochs=20,validation_data= (X_local_test, y_local_test), verbose=1)
Epoch 1/20
1821/1821 [==============================] - 11s 6ms/step - loss: 0.0787 - acc: 0.9776 - val_loss: 0.0035 - val_acc: 0.9995
Epoch 2/20
1821/1821 [==============================] - 11s 6ms/step - loss: 0.0042 - acc: 0.9993 - val_loss: 0.0030 - val_acc: 0.9995
Epoch 3/20
1821/1821 [==============================] - 12s 6ms/step - loss: 0.0039 - acc: 0.9993 - val_loss: 0.0028 - val_acc: 0.9995
Epoch 4/20
1821/1821 [==============================] - 10s 6ms/step - loss: 0.0036 - acc: 0.9993 - val_loss: 0.0030 - val_acc: 0.9995
Epoch 5/20
1821/1821 [==============================] - 10s 6ms/step - loss: 0.0033 - acc: 0.9993 - val_loss: 0.0030 - val_acc: 0.9996
Epoch 6/20
1821/1821 [==============================] - 10s 6ms/step - loss: 0.0031 - acc: 0.9994 - val_loss: 0.0030 - val_acc: 0.9995
Epoch 7/20
1821/1821 [==============================] - 10s 6ms/step - loss: 0.0031 - acc: 0.9994 - val_loss: 0.0031 - val_acc: 0.9995
Epoch 8/20
1821/1821 [==============================] - 10s 6ms/step - loss: 0.0030 - acc: 0.9994 - val_loss: 0.0030 - val_acc: 0.9995
Epoch 9/20
1821/1821 [==============================] - 10s 6ms/step - loss: 0.0027 - acc: 0.9994 - val_loss: 0.0031 - val_acc: 0.9995
Epoch 10/20
1821/1821 [==============================] - 10s 6ms/step - loss: 0.0027 - acc: 0.9995 - val_loss: 0.0031 - val_acc: 0.9995
Epoch 11/20
1821/1821 [==============================] - 10s 6ms/step - loss: 0.0026 - acc: 0.9995 - val_loss: 0.0033 - val_acc: 0.9994
Epoch 12/20
1821/1821 [==============================] - 10s 6ms/step - loss: 0.0026 - acc: 0.9995 - val_loss: 0.0032 - val_acc: 0.9995
Epoch 13/20
1821/1821 [==============================] - 10s 6ms/step - loss: 0.0025 - acc: 0.9994 - val_loss: 0.0033 - val_acc: 0.9995
Epoch 14/20
1821/1821 [==============================] - 10s 6ms/step - loss: 0.0023 - acc: 0.9994 - val_loss: 0.0033 - val_acc: 0.9995
Epoch 15/20
1821/1821 [==============================] - 10s 6ms/step - loss: 0.0023 - acc: 0.9995 - val_loss: 0.0034 - val_acc: 0.9995
Epoch 16/20
1821/1821 [==============================] - 10s 6ms/step - loss: 0.0024 - acc: 0.9994 - val_loss: 0.0034 - val_acc: 0.9995
Epoch 17/20
1821/1821 [==============================] - 10s 5ms/step - loss: 0.0020 - acc: 0.9995 - val_loss: 0.0035 - val_acc: 0.9995
Epoch 18/20
1821/1821 [==============================] - 10s 5ms/step - loss: 0.0021 - acc: 0.9995 - val_loss: 0.0034 - val_acc: 0.9996
Epoch 19/20
1821/1821 [==============================] - 10s 6ms/step - loss: 0.0022 - acc: 0.9994 - val_loss: 0.0035 - val_acc: 0.9995
Epoch 20/20
1821/1821 [==============================] - 10s 6ms/step - loss: 0.0021 - acc: 0.9995 - val_loss: 0.0035 - val_acc: 0.9995
Out[47]:
<keras.callbacks.History at 0x7f26ae0e47b8>

@timeit def fit_balanced_model(X_train, y_train, X_test, y_test): model = make_model(X_train.shape[1]) training_generator, steps_per_epoch = balanced_batch_generator(X_train, y_train, batch_size=100, random_state=42) model.fit_generator(generator=training_generator, steps_per_epoch=steps_per_epoch, epochs=10, verbose=1) return model

In [48]:
y_pred_keras = model.predict_classes(X_test, batch_size=100, verbose=0)
In [49]:
y_pred_keras.shape
Out[49]:
(56961, 1)
In [50]:
#y_pred_keras =keras_model.predict_classes(X_test)
labels = ['No Fraud', 'Fraud']
print(classification_report(y_test, y_pred_keras, target_names=labels))
              precision    recall  f1-score   support

    No Fraud       1.00      1.00      1.00     56863
       Fraud       0.90      0.74      0.82        98

   micro avg       1.00      1.00      1.00     56961
   macro avg       0.95      0.87      0.91     56961
weighted avg       1.00      1.00      1.00     56961

In [51]:
cm_keras = confusion_matrix(y_test, y_pred_keras)
plot_confusion_matrix(cm_keras, classes=np.unique(data['Class']),
                      title='Keras Model')
In [52]:
from sklearn.metrics import average_precision_score
#y_keras_score = keras_model.predict_proba(X_test)[:,1]
y_keras_score = model.predict_proba(X_test, batch_size=100)
keras_average_precision = average_precision_score(y_test, y_keras_score)

print('Average precision-recall score: {0:0.2f}'.format(
      keras_average_precision))
Average precision-recall score: 0.82
In [53]:
thresholds = [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]

plt.figure(figsize=(10,10))

j = 1
for i in thresholds:
    y_keras_recall = y_keras_score > i
    
    plt.subplot(3,3,j)
    j += 1
    
    # Compute confusion matrix
    cnf_matrix = confusion_matrix(y_test,y_keras_recall)
    np.set_printoptions(precision=2)
    print('Threshold {}'.format(i))
    print("Precision metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,1]+cnf_matrix[0,1]))
    print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))

    # Plot non-normalized confusion matrix
    class_names = ['0','1']
    plot_confusion_matrix(cnf_matrix
                          , classes=class_names
                          , title='Threshold >= %s'%i) 
Threshold 0.1
Precision metric in the testing dataset:  0.8020833333333334
Recall metric in the testing dataset:  0.7857142857142857
Threshold 0.2
Precision metric in the testing dataset:  0.8369565217391305
Recall metric in the testing dataset:  0.7857142857142857
Threshold 0.3
Precision metric in the testing dataset:  0.8522727272727273
Recall metric in the testing dataset:  0.7653061224489796
Threshold 0.4
Precision metric in the testing dataset:  0.8690476190476191
Recall metric in the testing dataset:  0.7448979591836735
Threshold 0.5
Precision metric in the testing dataset:  0.9012345679012346
Recall metric in the testing dataset:  0.7448979591836735
Threshold 0.6
Precision metric in the testing dataset:  0.935064935064935
Recall metric in the testing dataset:  0.7346938775510204
Threshold 0.7
Precision metric in the testing dataset:  0.9466666666666667
Recall metric in the testing dataset:  0.7244897959183674
Threshold 0.8
Precision metric in the testing dataset:  0.971830985915493
Recall metric in the testing dataset:  0.7040816326530612
Threshold 0.9
Precision metric in the testing dataset:  1.0
Recall metric in the testing dataset:  0.5816326530612245
In [54]:
fig = plt.figure(figsize=(12,6))

precision, recall, _ = precision_recall_curve(y_test, y_keras_score)

plt.step(recall, precision, color='r', alpha=0.2,
         where='post')
plt.fill_between(recall, precision, step='post', alpha=0.2,
                 color='#F59B00')

plt.xlabel('Recall')
plt.ylabel('Precision')
plt.ylim([0.0, 1.05])
plt.xlim([0.0, 1.0])
plt.title('OverSampling Precision-Recall curve: \n Average Precision-Recall Score ={0:0.2f}'.format(
          keras_average_precision), fontsize=16)
Out[54]:
Text(0.5, 1.0, 'OverSampling Precision-Recall curve: \n Average Precision-Recall Score =0.82')
In [55]:
from sklearn import metrics as metrics
from sklearn.metrics import matthews_corrcoef
In [56]:
def classifier_metrics(model,estimator,actual,y_pred,proba):
    # Calculating the classification metrics -
    # Input - model predicted values and probablilites
    class_metrics ={
                        'Accuracy' : metrics.accuracy_score(actual, y_pred),
                        'Precision' : metrics.precision_score(actual, y_pred),
                        'Recall' : metrics.recall_score(actual, y_pred),
                        'F1 Score' : metrics.f1_score(actual, y_pred),
                        'ROC AUC' : metrics.roc_auc_score(actual, proba),
                        'Matthews Correlation Coefficient': matthews_corrcoef(actual,y_pred)    
                      }

  
    df_metrics = pd.DataFrame.from_dict(class_metrics, orient='index')
    df_metrics.columns = [model]
    
    print('\n'+ model +'  Metrics:')
    
    print(df_metrics)

    return  df_metrics
In [57]:
classifier_metrics('keras',model,y_test, y_pred_keras, y_keras_score)
keras  Metrics:
                                     keras
Accuracy                          0.999421
Precision                         0.901235
Recall                            0.744898
F1 Score                          0.815642
ROC AUC                           0.950933
Matthews Correlation Coefficient  0.819069
Out[57]:
keras
Accuracy 0.999421
Precision 0.901235
Recall 0.744898
F1 Score 0.815642
ROC AUC 0.950933
Matthews Correlation Coefficient 0.819069
In [58]:
print(model.summary())
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_6 (Dense)              (None, 200)               6000      
_________________________________________________________________
activation_5 (Activation)    (None, 200)               0         
_________________________________________________________________
batch_normalization_5 (Batch (None, 200)               800       
_________________________________________________________________
dropout_5 (Dropout)          (None, 200)               0         
_________________________________________________________________
dense_7 (Dense)              (None, 100)               20100     
_________________________________________________________________
activation_6 (Activation)    (None, 100)               0         
_________________________________________________________________
batch_normalization_6 (Batch (None, 100)               400       
_________________________________________________________________
dropout_6 (Dropout)          (None, 100)               0         
_________________________________________________________________
dense_8 (Dense)              (None, 50)                5050      
_________________________________________________________________
activation_7 (Activation)    (None, 50)                0         
_________________________________________________________________
batch_normalization_7 (Batch (None, 50)                200       
_________________________________________________________________
dropout_7 (Dropout)          (None, 50)                0         
_________________________________________________________________
dense_9 (Dense)              (None, 25)                1275      
_________________________________________________________________
activation_8 (Activation)    (None, 25)                0         
_________________________________________________________________
batch_normalization_8 (Batch (None, 25)                100       
_________________________________________________________________
dropout_8 (Dropout)          (None, 25)                0         
_________________________________________________________________
dense_10 (Dense)             (None, 1)                 26        
=================================================================
Total params: 33,951
Trainable params: 33,201
Non-trainable params: 750
_________________________________________________________________
None